Qualitative variables take values in an unordered set \(\mathcal{C}\), such as: \(\text{eye color} \in \{\text{brown}, \text{blue}, \text{green}\}\).
Given a feature vector \(X\) and a qualitative response \(Y\) taking values in the set \(\mathcal{C}\), the classification task is to build a function \(C(X)\) that takes as input the feature vector \(X\) and predicts its value for \(Y\), i.e. \(C(X) \in \mathcal{C}\).
Often we are more interested in estimating the probabilities that \(X\) belongs to each category in \(\mathcal{C}\).
# A tibble: 10,000 × 4
default student balance income
<fct> <fct> <dbl> <dbl>
1 No No 730. 44362.
2 No Yes 817. 12106.
3 No No 1074. 31767.
4 No No 529. 35704.
5 No No 786. 38463.
6 No Yes 920. 7492.
7 No No 826. 24905.
8 No Yes 809. 17600.
9 No No 1161. 37469.
10 No No 0 29275.
# ℹ 9,990 more rows
ggplot(data = Default, mapping =aes(x = balance, y = income, color = default) ) +geom_point() +labs(title ="Credit Default Data", x ="Balance", y ="Income" )
# Visualize income ~ balanceggplot(data = Default, mapping =aes(x = balance, y = income, color = default) ) +geom_point() +labs(title ="Credit Default Data", x ="Balance", y ="Income" )
# Visualize balance ~ defaultggplot(data = Default, mapping =aes(x = default, y = balance) ) +geom_boxplot() +labs(title ="Credit Default Data", x ="Default", y ="Balance" )
# Visualize income ~ defaultggplot(data = Default, mapping =aes(x = default, y = income) ) +geom_boxplot() +labs(title ="Credit Default Data", x ="Default", y ="Income" )
# Visualize student ~ defaultggplot(data = Default, mapping =aes(x = default, fill = student) ) +geom_bar(position ="fill") +labs(title ="Credit Default Data", x ="Default", y ="Count" )
# Load the pandas libraryimport pandas as pd# Load numpy for array manipulationimport numpy as np# Load seaborn plotting libraryimport seaborn as snsimport matplotlib.pyplot as plt# Set font sizes in plotssns.set(font_scale =1.2)# Display all columnspd.set_option('display.max_columns', None)# Import Credit Default dataDefault = pd.read_csv("../data/Default.csv")Default
default student balance income
0 No No 729.526495 44361.625074
1 No Yes 817.180407 12106.134700
2 No No 1073.549164 31767.138947
3 No No 529.250605 35704.493935
4 No No 785.655883 38463.495879
... ... ... ... ...
9995 No No 711.555020 52992.378914
9996 No No 757.962918 19660.721768
9997 No No 845.411989 58636.156984
9998 No No 1569.009053 36669.112365
9999 No Yes 200.922183 16862.952321
[10000 rows x 4 columns]
# Visualize income ~ balanceplt.figure()sns.relplot( data = Default, x ='balance', y ='income', hue ='default', style ='default', height =10 ).set( xlabel ='Balance', ylabel ='Income' );plt.show()
# Visualize balance ~ defaultplt.figure()sns.boxplot( data = Default, x ='default', y ='balance' ).set( xlabel ='Default', ylabel ='Balance' );plt.show()
# Visualize income ~ defaultplt.figure()sns.boxplot( data = Default, x ='default', y ='income' ).set( xlabel ='Default', ylabel ='Income' );plt.show()
# Visualize student ~ defaultplt.figure()sns.countplot( data = Default, x ='default', hue ='student' ).set( xlabel ='Default', ylabel ='Count' );plt.show()
3 Linear vs logistic regression
If we code default as \[
Y = \begin{cases}
0 & \text{if No} \\
1 & \text{if Yes}
\end{cases},
\] can we simply perform a linear regression of \(Y\) on \(X\) and classify as Yes if \(\hat Y > 0.5\)?
Since \(\operatorname{E}(Y \mid X = x) = \operatorname{Pr}(Y=1 \mid X = x)\), we might think that linear regression is perfect for this task.
The issue is that linear regression may produce probabilities <0 or >1, which does not make sense.
Figure 1: Predicted probabilities by linear regression (blue line) vs logistic regression (green line).
# 0-1 (No-Yes) codingDefault['y'] = (Default['default'] =='Yes').astype(int)# Linear regression fit vs logistic regression fitplt.figure()sns.regplot( data = Default, x ='balance', y ='y', label ='Linear regression' )sns.regplot( data = Default, x ='balance', y ='y', logistic =True, label ='Logistic regression', line_kws = {'color': 'g'} ).set( xlabel ='Balance', ylabel ='Probability of Default' )plt.show()
Now suppose we have a response variable with three possible values. A patient presents at the emergency room, and we must classify them according to their symptoms. \[
Y = \begin{cases}
1 & \text{if stroke} \\
2 & \text{if drug overdose} \\
3 & \text{if epileptic seizure}
\end{cases}.
\] This coding suggests an ordering, and in fact implies that the difference between stroke and drug overdose is the same as between drug overdose and epileptic seizure.
Linear regression is not appropriate here. Multiclass Logistic Regression or Discriminant Analysis are more appropriate.
A bit of rearrangement gives \[
\log \left( \frac{p(X)}{1 - p(X)} \right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p.
\] This monotone transformation is called the log odds or logit transformation of \(p(X)\).
The left-hand side is called the logit of \(p(X)\).
The quantity \(p(X)/[1−p(X)]\) is called the odds, and can take on any value between \([0, \infty]\).
For example, on average 1 in 5 people with an odds of \(1/4\) will default, since \(p(X) = 0.2\) implies an odds of \(0.2 = 1/4\).
We use maximum likelihood to estimate the parameters. That is find \(\beta_0, \ldots, \beta_p\) that maximizes the likelihood function \[
\ell(\beta_0, \ldots, \beta_p) = \prod_{i:y_i=1} p(x_i) \prod_{i: y_i = 0} [1 - p(x_i)].
\]
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
The area-under-ROC curve (AUC) of logistic regression (sklearn) is
from sklearn.metrics import roc_auc_scorelogit_auc = roc_auc_score( y, logit_fit.predict_proba(X)[:, 1])logit_auc
0.9070324073944639
import statsmodels.api as smimport statsmodels.formula.api as smf# Fit logistic regressionlogit_mod = smf.logit( formula ='y ~ balance + income + student', data = Default ).fit()
Optimization terminated successfully.
Current function value: 0.078577
Iterations 10
logit_mod.summary()
Logit Regression Results
Dep. Variable:
y
No. Observations:
10000
Model:
Logit
Df Residuals:
9996
Method:
MLE
Df Model:
3
Date:
Tue, 30 Jan 2024
Pseudo R-squ.:
0.4619
Time:
17:25:23
Log-Likelihood:
-785.77
converged:
True
LL-Null:
-1460.3
Covariance Type:
nonrobust
LLR p-value:
3.257e-292
coef
std err
z
P>|z|
[0.025
0.975]
Intercept
-10.8690
0.492
-22.079
0.000
-11.834
-9.904
student[T.Yes]
-0.6468
0.236
-2.738
0.006
-1.110
-0.184
balance
0.0057
0.000
24.737
0.000
0.005
0.006
income
3.033e-06
8.2e-06
0.370
0.712
-1.3e-05
1.91e-05
Possibly complete quasi-separation: A fraction 0.15 of observations can be perfectly predicted. This might indicate that there is complete quasi-separation. In this case some parameters will not be identified.
# Predicted labels from logistic regressionlogit_smpred = np.where(logit_mod.predict(X) >0.5, 'Yes', 'No')# Confusion matrix from the logistic regressionlogit_smcfm = pd.crosstab( logit_smpred, y, margins =True, rownames = ['Predicted Default Stats'], colnames = ['True Default Status'] )
Overall accuracy of logistic regression (by statsmodels)
The area-under-ROC curve (AUC) of logistic regression (statsmodels) is
logit_smauc = roc_auc_score( y, logit_mod.predict(X))logit_smauc
0.9495581233452343
We interpret the logistic regression coefficient as the expected change in log odds associated with one-unit increase in the corresponding predictor.
Equivalently, it multiplies the odds by a factor of \(e^{\beta}\).
We can measure the accuracy of the coefficient estimates by computing their standard errors.
The z-statistic (z value) in R output plays the same role as the t-statistic in the linear regression output
It is the coefficient divided by its standard error.
The null hypotheses is that the coefficient is zero, or equivalently that the odds ratio is 1.
If pvalue is small, we can reject the null hypothesis and conclude that higher balance is associated with the higher probability of default.
Wait! Why the coefficient of student is negative, contradicting with the plot?
Confounding:student status is confounded with balance. Students tend to have higher balances than non-students, so their marginal default rate is higher than for non-students.
But for each level of balance, students default less than non-students.
Default %>%mutate(yhat = logit_mod$fitted.values) %>%ggplot() +geom_point(mapping =aes(x = balance, y = yhat, color = student)) +labs(x ="Credit Card Balance",y ="Default Rate",subtitle ="Even though students default less at than non-students for each level of balance, marginally their default rate is higher than for non-students." )
Figure 2: Student status confounds with credit card balance. Students tend to have higher balances than non-students. Even though students default less at than non-students for each level of balance, marginally their default rate is higher than for non-students.
Code
# Visualize balance ~ defaultplt.figure()sns.boxplot( data = Default, x ='student', y ='balance' ).set( xlabel ='Student Status', ylabel ='Credit Card Balance' );plt.show()
Code
# Add predicted probabilities to DataFrameDefault['yhat'] = logit_mod.predict()# Visualize yhat ~ balanceplt.figure()sns.relplot( data = Default, x ='balance', y ='yhat', kind ='line', hue ='student', height =8).set( xlabel ='Credit Card Balance', ylabel ='Default Rate');plt.show()
4.1 Making Predictions
After fitting the logistic regression, for an individual \(i\), we can compute the probability of default using the estimated coefficients: \[
\log \left( \frac{ \hat p(\text{balance}_i, \text{income}_i, \text{student}_i)}{1- \hat p(\text{balance}_i, \text{income}_i, \text{student}_i)} \right) = \hat \beta_0 + \hat \beta_1 \text{balance}_i + \hat\beta_2 \text{income}_i + \beta_3 \text{student}_i
\] which is \[
\hat p(\text{default}_i = \text{Yes}\mid \text{balance}_i, \text{income}_i, \text{student}_i) = \frac{e^{\hat \beta_0 + \hat \beta_1 \text{balance}_i + \hat\beta_2 \text{income}_i + \beta_3 \text{student}_i}}{1+e^{\hat \beta_0 + \hat \beta_1 \text{balance}_i + \hat\beta_2 \text{income}_i + \beta_3 \text{student}_i}}
\] then classify the individual to default if \(\hat p(\text{balance}_i, \text{income}_i, \text{student}_i) > 0.5\).
Dummy variable approach is used for qualitative predictors with the logistic regression model as well.
For example, student is a qualitative predictor with two levels, Yes and No.
We can create a dummy variable: a value of \(1\) for students and \(0\) for non-students.
Then we include this dummy variable as a predictor in the logistic regression equation in order to predict the probability of default for a given value of balance for both students and non-students, e.g., a student with a credit card balance of \(\$1,500\) and an income of \(\$40,000\) has an estimated probability of default of \[\begin{eqnarray*}
& & \hat p(\text{default = Yes} | \text{balance = 1500, income = 40,000, student = Yes}) \\
&=& \frac{e^{−10.869+0.00574\times 1500+0.003\times 40−0.6468\times 1}}{1 + e^{−10.869+0.00574\times 1500+0.003\times40−0.6468\times 1}} \\
&=& 0.058
\end{eqnarray*}\]
We can use the predict() function to predict the probability that a given observation belongs to a particular class, rather than predicting the class itself.
(p =predict(logit_mod, newdata =data.frame(balance =1500, income =40000, student ="Yes"), type ="response"))
1
0.05788194
5 Multinomial logit regression
For more than two classes, we generalize logistic regression to \[
\operatorname{Pr}(Y = k \mid X) = \frac{e^{\beta_{0k} + \beta_{1k} X_1 + \cdots + \beta_{pk} X_p}}{\sum_{\ell = 1}^K e^{\beta_{0\ell} + \beta_{1\ell} X_1 + \cdots + \beta_{p\ell} X_p}}.
\] Note each class has its own set of regression coefficients.
We first select a single class to serve as the baseline; then for each of the remaining \(K-1\) classes, we fit a separate binary logistic regression model that compares that class to the baseline. \[
p(Y=k|X=x) = \frac{e^{\beta_{k0}+\beta_{k1}x_1+\ldots+\beta_{kp}x_p}}{1+\sum_{\ell=1}^{K-1}e^{\beta_{\ell0}+\beta_{\ell1}x_1+\ldots+\beta_{\ell p}x_p}}, \quad k=1,\ldots,K-1.
\] and the baseline class is \[
p(Y=K|X=x) = \frac{1}{1+\sum_{\ell=1}^{K-1}e^{\beta_{\ell0}+\beta_{\ell1}x_1+\ldots+\beta_{\ell p}x_p}}.
\] Then we can show \[
\log\left(\frac{p(Y=k|X=x)}{p(Y=K|X=x)}\right) = \beta_{k0}+\beta_{k1}x_1+\ldots+\beta_{kp}x_p, \quad k=1,\ldots,K-1.
\]
Therefore, the interpretation of the coefficients in a multinomial logistic regression model is tied to the choice of baseline.
For example, if we set epileptic seizure to be the baseline, then a one-unit increase in \(X_j\) is associated with a \(\beta_{stroke,j}\) increase in the log odds of stroke over epileptic seizure. Stated another way, if \(X_j\) increases by one unit, then the odds of stroke over epileptic seizure \[
\frac{p(Y=\text{stroke}|X=x)}{p(Y=\text{epileptic seizure}|X=x)}
\] are multiplied by \(e^{\beta_{stroke,j}}\).
Another approach for classification is to model the distribution of \(X\) in each of the classes separately, and then use Bayes theorem to flip things around and obtain \(\operatorname{Pr}(Y = j \mid X)\).
When we use normal (Gaussian) distributions for each class, this leads to linear or quadratic discriminant analysis.
However, this approach is quite general, and other distributions can be used as well. We will focus on normal distributions.
Thomas Bayes was a famous mathematician whose name represents a big subfield of statistical and probabilistic modeling. Here we focus on a simple result, known as Bayes theorem: \[\begin{eqnarray*}
p_k(x) &=& \operatorname{Pr}(Y = k \mid X = x) \\
&=& \frac{\operatorname{Pr}(X = x, Y = k)}{\operatorname{Pr}(X = x)} \\
&=& \frac{\operatorname{Pr}(X = x \mid Y = k) \cdot \operatorname{Pr}(Y = k)}{\operatorname{Pr}(X = x)} \\
&=& \frac{\pi_k f_k(x)}{\sum_{\ell=1}^K \pi_\ell f_\ell(x)},
\end{eqnarray*}\] where
\(f_k(x) = \operatorname{Pr}(X = x \mid Y = k)\) is the density of \(X\) in class \(k\)
\(\pi_k = \operatorname{Pr}(Y = k)\) is the marginal or prior probability for class \(k\).
Advantages of discriminant analysis.
When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable (separation or quasi-separation). Linear discriminant analysis does not suffer from this problem.
If \(n\) is small and the distribution of the predictors \(X\) is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model.
6.1 Linear discriminant analysis (LDA)
Assume \(X \in \mathbb{R}^p\) in the \(k\)-th class is distributed as as multivariate normal \(N(\mu_k, \boldsymbol{\Sigma})\) with density \[
f_k(x) = \frac{1}{(2\pi)^{p/2} |\boldsymbol{\Sigma}|^{1/2}} e^{- \frac{1}{2} (x - \mu_k)^T \boldsymbol{\Sigma}^{-1} (x - \mu_k)}.
\] To classify \(X=x\), we need to see which of \(p_k(x)\) is big. Taking logs, and discarding terms that do not depend on \(k\), we just need to assign \(x\) to the class with the largest discriminant score\[
\delta_k(x) = x^T \boldsymbol{\Sigma}^{-1} \mu_k - \frac{1}{2} \mu_k^T \boldsymbol{\Sigma}^{-1} \mu_k + \log \pi_k,
\] which is a linear function in \(x\)!
For example, in the one-dimensional setting, the normal density takes the form \[
f_k(x) = \frac{1}{\sqrt{2\pi\sigma_k}} \exp\left(-\frac{1}{2\sigma_k^2}(x - \mu_k)^2\right)
\]
where \(\mu_k\) and \(\sigma_{k}^2\) are the mean and variance parameters for the \(k\)th class. We assume shared variance term across all \(K\) classes, \(\sigma_{1}^2=\ldots=\sigma_{K}^2 = \sigma^2\). Then \[
p_k(x) = \frac{\pi_k \frac{1}{\sqrt{2\pi\sigma}} \exp\left(-\frac{1}{2\sigma^2}(x - \mu_k)^2\right)}{\sum_{l=1}^{K} \pi_l \frac{1}{\sqrt{2\pi\sigma}} \exp\left(-\frac{1}{2\sigma^2}(x - \mu_l)^2\right)}.
\] Taking the log of both sides and rearrange terms with \(k\)\[
\delta_k = x \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + \log \pi_k.
\]
The linear discriminant function \(\delta_k(x)\) is a linear combination of the features \(x_1, \ldots, x_p\).
We estimate the unknown parameters \(\pi_k\), \(\mu_k\), and \(\boldsymbol{\Sigma}\) by \[\begin{eqnarray*}
\hat{\pi}_k &=& \frac{n_k}{n} \\
\hat{\mu}_k &=& \frac{1}{n_k} \sum_{i: y_i = k} x_i \\
\hat{\boldsymbol{\Sigma}} &=& \frac{1}{n-K} \sum_{k=1}^K \sum_{i: y_i = k} (x_i - \hat \mu_k)(x_i - \hat \mu_k)^T.
\end{eqnarray*}\]
Once we have estimated \(\hat \delta_k(x)\), we can turn these into class probabilities \[
\hat{\operatorname{Pr}}(Y = k \mid X = x) = \frac{e^{\hat \delta_k(x)}}{\sum_{\ell=1}^K e^{\hat \delta_{\ell}(x)}}.
\]
Figure 3: Here \(\pi_1 = \pi_2 = \pi_3 = 1/3\). The dashed lines are known as the Bayes decision boundaries. Were they known, they would yield the fewest misclassification errors, among all possible classifiers. The black line is the LDA decision boundary.
library(MASS)# Fit LDAlda_mod <-lda( default ~ balance + income + student, data = Default )lda_mod
Call:
lda(default ~ balance + income + student, data = Default)
Prior probabilities of groups:
No Yes
0.9667 0.0333
Group means:
balance income studentYes
No 803.9438 33566.17 0.2914037
Yes 1747.8217 32089.15 0.3813814
Coefficients of linear discriminants:
LD1
balance 2.243541e-03
income 3.367310e-06
studentYes -1.746631e-01
True Default Status No Yes All
Predicted Default Stats
No 9645 254 9899
Yes 22 79 101
All 9667 333 10000
The area-under-ROC curve (AUC) of LDA is
lda_auc = roc_auc_score( y, lda_fit.predict_proba(X)[:, 1])lda_auc
0.9495202246831501
Note:
Training error rates will usually be lower than test error rates, which are the real quantity of interest.
The higher the ratio of parameters \(p\) to number of samples n, the more we expect this overfitting to play a role.
Since only \(3.33\%\) of the individuals in the training sample defaulted, a simple but useless classifier that always predicts that an individual will not default, regardless of his or her credit card balance and student status, will result in an error rate of \(3.33\%\). In other words, the trivial null classifier will achieve an error rate that null is only a bit higher than the LDA training set error rate.
However this can change, depending how we define default. In the two-class case, we often assign an observation to the default class if \[
p(\text{default = Yes}|X = x) > 0.5
\]
But if we are interested in minimizing the number of defaults that we fail to identify, then we should instead assign an observation to the default class if \[
p(\text{default = Yes}|X = x) > 0.2.
\]
6.2 Quadratic discriminant analysis (QDA)
In LDA, the normal distribution for each class shares the same covariance \(\boldsymbol{\Sigma}\).
If we assume that the normal distribution for class \(k\) has covariance \(\boldsymbol{\Sigma}_k\), then it leads to the quadratic discriminant analysis (QDA).
The discriminant function takes the form \[
\delta_k(x) = - \frac{1}{2} (x - \mu_k)^T \boldsymbol{\Sigma}_k^{-1} (x - \mu_k) + \log \pi_k - \frac{1}{2} \log |\boldsymbol{\Sigma}_k|,
\] which is a quadratic function in \(x\).
Figure 4: The Bayes (purple dashed), LDA (black dotted), and QDA (green solid) decision boundaries for a two-class problem. Left: \(\boldsymbol{\Sigma}_1 = \boldsymbol{\Sigma}_2\). Right: \(\boldsymbol{\Sigma}_1 \ne \boldsymbol{\Sigma}_2\).
# Fit QDAqda_mod <-qda( default ~ balance + income + student, data = Default )qda_mod
Call:
qda(default ~ balance + income + student, data = Default)
Prior probabilities of groups:
No Yes
0.9667 0.0333
Group means:
balance income studentYes
No 803.9438 33566.17 0.2914037
Yes 1747.8217 32089.15 0.3813814
qda_auc = roc_auc_score( y, qda_fit.predict_proba(X)[:, 1])qda_auc
0.9492120650701389
6.3 Naive Bayes
If we assume \(f_k(x) = \prod_{j=1}^p f_{jk}(x_j)\) (conditional independence model) in each class, we get naive Bayes.
For Gaussian this means the \(\boldsymbol{\Sigma}_k\) are diagonal.
Naive Bayes is useful when \(p\) is large (LDA and QDA break down).
Can be used for \(mixed\) feature vectors (both continuous and categorical). If \(X_j\) is qualitative, replace \(f_{kj}(x_j)\) with probability mass function (histogram) over discrete categories.
Despite strong assumptions, naive Bayes often produces good classification results.
Note : For each categorical variable a table giving, for each attribute level, the conditional probabilities given the target class. For each numeric variable, a table giving, for each target class, mean and standard deviation of the (sub-)variable.
library(e1071)# Fit Naive Bayes classifiernb_mod <-naiveBayes( default ~ balance + income + student, data = Default )nb_mod
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
No Yes
0.9667 0.0333
Conditional probabilities:
balance
Y [,1] [,2]
No 803.9438 456.4762
Yes 1747.8217 341.2668
income
Y [,1] [,2]
No 33566.17 13318.25
Yes 32089.15 13804.22
student
Y No Yes
No 0.7085963 0.2914037
Yes 0.6186186 0.3813814
nb_auc = roc_auc_score( y, nb_fit.predict_proba(X)[:, 1])nb_auc
0.9450012751967858
7\(K\)-nearest neighbor (KNN) classifier
KNN is a nonparametric classifier.
Given a positive integer \(K\) and a test observation \(x_0\), the KNN classifier first identifies the \(K\) points in the training data that are closest to \(x_0\), represented by \(\mathcal{N}_0\). It estimates the conditional probability by \[
\operatorname{Pr}(Y=j \mid X = x_0) = \frac{1}{K} \sum_{i \in \mathcal{N}_0} I(y_i = j)
\] and then classifies the test observation \(x_0\) to the class with the largest probability.
We illustrate KNN with \(K=5\) on the credit default data.
False positive rate: The fraction of negative examples that are classified as positive.
False negative rate: The fraction of positive examples that are classified as negative.
Above table is the training classification performance of classifiers using their default thresholds. Varying thresholds lead to varying false positive rates (1 - specificity) and true positive rates (sensitivity). These can be plotted as the receiver operating characteristic (ROC) curve. The overall performance of a classifier is summarized by the area under ROC curve (AUC).
from sklearn.metrics import roc_curvefrom sklearn.metrics import RocCurveDisplay# plt.rcParams.update({'font.size': 12})for idx, m inenumerate(classifiers): plt.figure(); RocCurveDisplay.from_estimator(m, X, y, name = sm_df.iloc[idx].name); plt.show()
<Figure size 1400x1000 with 0 Axes>
<sklearn.metrics._plot.roc_curve.RocCurveDisplay object at 0x2ac5ab390>
<Figure size 1400x1000 with 0 Axes>
<sklearn.metrics._plot.roc_curve.RocCurveDisplay object at 0x2ac5fd5d0>
<Figure size 1400x1000 with 0 Axes>
<sklearn.metrics._plot.roc_curve.RocCurveDisplay object at 0x2acf82050>
<Figure size 1400x1000 with 0 Axes>
<sklearn.metrics._plot.roc_curve.RocCurveDisplay object at 0x2ad991590>
<Figure size 1400x1000 with 0 Axes>
<sklearn.metrics._plot.roc_curve.RocCurveDisplay object at 0x2ad68ae10>
# ROC curve of the null classifier (always No or always Yes) plt.figure()RocCurveDisplay.from_predictions(y, np.repeat(0, 10000), pos_label ='Yes', name ='Null Classifier');plt.show()
See sklearn.metrics for other popularly used metrics for classification tasks.
9 Comparison between classifiers
For a two-class problem, we can show that for LDA, \[
\log \left( \frac{p_1(x)}{1 - p_1(x)} \right) = \log \left( \frac{p_1(x)}{p_2(x)} \right) = c_0 + c_1 x_1 + \cdots + c_p x_p.
\] So it has the same form as logistic regression. The difference is in how the parameters are estimated.
Logistic regression uses the conditional likelihood based on \(\operatorname{Pr}(Y \mid X)\) (known as discriminative learning).
LDA, QDA, and Naive Bayes use the full likelihood based on \(\operatorname{Pr}(X, Y)\) (known as generative learning).
Despite these differences, in practice the results are often very similar.
Logistic regression can also fit quadratic boundaries like QDA, by explicitly including quadratic terms in the model.
Logistic regression is very popular for classification, especially when \(K = 2\).
LDA is useful when \(n\) is small, or the classes are well separated, and Gaussian assumptions are reasonable. Also when \(K > 2\).
Naive Bayes is useful when \(p\) is very large.
LDA is a special case of QDA.
Under normal assumption, Naive Bayes leads to linear decision boundary, thus a special case of LDA.
KNN classifier is non-parametric and can dominate LDA and logistic regression when the decision boundary is highly nonlinear, provided that \(n\) is very large and \(p\) is small.
See ISL Section 4.5 for theoretical and empirical comparisons of logistic regression, LDA, QDA, NB, and KNN.
10 Poisson regression
\(Y\) is neither qualitative nor quantitative, but rather a count.
For example, Bikeshare data set:
The response is bikers, the number of hourly users of a bike sharing program in Washington, DC.
Predictors:
mnth month of the year,
hr hour of the day, from 0 to 23,
workingday an indicator variable that equals 1 if it is neither a weekend nor a holiday,
temp the normalized temperature, in Celsius,
weathersit a qualitative variable that takes on one of four possible values: clear; misty or cloudy; light rain or light snow; or heavy rain or heavy snow.
We could do a transformation, i.e., \(\log\)\[
\log(Y) = \sum_{j=1}^p \beta_j X_j + \epsilon.
\]
But the model is no longer linear in the predictors and the interpretation of the coefficients is for \(\log(Y)\) rather than \(Y\).
10.2 Poisson regression
We often use Poisson regression to model counts \(Y\).
Poisson distribution: a random variable \(Y\) takes on nonnegative integer values, i.e. \(Y \in \{0, 1, 2, \ldots\}\). If \(Y\) follows the Poisson distribution, then \[
p(Y=k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \ldots
\] where \(\lambda\) is the mean and variance of \(Y\), i.e., \(\mathbf{E}(Y)\).
Poisson regression model:
Assume that \(Y\) follows a Poisson distribution with mean \(\lambda\).
We model the log mean as a linear function of the predictors: \[
\log(\lambda(X_1, \ldots, X_p)) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p.
\] or equivalently \[
\lambda(X_1, \ldots, X_p) = \exp(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p).
\]
To estimate the parameters \(\beta_0, \ldots, \beta_p\), we use the method of maximum likelihood.
The likelihood function is \[
\ell(\beta_0,\ldots,\beta_p) = \prod_{i=1}^n \frac{\lambda_i(x_i)^{y_i} e^{-\lambda_i}}{y_i!}
\]
We estimate the coefficients that maximize the likelihood \(\ell(\beta_0,\ldots,\beta_p)\) i.e. that make the observed data as likely as possible.
1 IRR = Incidence Rate Ratio, SE = Standard Error, CI = Confidence Interval
Interpretation: an increase in \(X_j\) by one unit is associated with a change in \(\mathbf{E}(Y ) = \lambda\) by a factor of \(\exp(\beta_j)\).
Mean-variance relationship: since under the Poisson model, \(\lambda = \mathbf{E}(Y ) = \mathbf{Var}(Y)\), we implicitly modeled variance, not a constant any more. If variance is much larger than the mean, i.e., overdispersion, we can use negative binomial regression instead.
Nonnegative fitted values: no negative predictions using the Poisson regression model. This is because the Poisson model itself assumes that the response is nonnegative.
11 Generalized linear models
We have now discussed three types of regression models: linear, logistic, and Poisson. These approaches share some common characteristics:
Each approach uses predictors \(X_1,\ldots,X_p\) to predict a response \(Y\).
We assume that, conditional on \(X_1,\ldots,X_p\), \(Y\) belongs to a certain family of distributions.
For linear regression, assume that \(Y\) follows a Gaussian or normal distribution.
For logistic regression, we assume that \(Y\) follows a Bernoulli distribution.
For Poisson regression, we assume that \(Y\) follows a Poisson distribution.
Each approach models the mean of \(Y\) as a function of the predictors.
In linear regression, the mean of \(Y\) takes the form, i.e., linear function of the predictors \[
\mathbf{E}(Y \mid X_1,\ldots,X_p) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p.
\]
For logistic regression, the mean instead takes the form \[
\mathbf{E}(Y \mid X_1,\ldots,X_p) = \frac{\exp(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)}{1 + \exp(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)}.
\]
For Poisson regression, the mean instead takes the form \[
\mathbf{E}(Y \mid X_1,\ldots,X_p) = \exp(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p).
\]
Each equation can be expressed using a link function, \(\eta\), which applies a transformation to \(\mathbf{E}(Y \mid X_1,\ldots,X_p)\) so that the transformed mean is a linear function of the predictors \[
\eta(\mathbf{E}(Y \mid X_1,\ldots,X_p)) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p.
\]
The link function for linear regression is the identity function, \(\eta(\mu) = \mu\).
The link function for logistic regression is the logit function, \(\eta(\mu) = \log(\mu/(1-\mu))\).
The link function for Poisson regression is the log function, \(\eta(\mu) = \log(\mu)\).
The Gaussian, Bernoulli and Poisson distributions are all members of a wider class of distributions, known as the exponential family.
In general, we form a regression by modeling the response \(Y\) as from a particular member of the exponential family, and then transforming the mean of the response so that the transformed mean is a linear function of the predictors via \(\eta(\mathbf{E}(Y \mid X_1,\ldots,X_p)) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\).
Any regression approach that follows this very general recipe known as a generalized linear model(GLM). Thus, linear regression, logistic regression, and Poisson regression are all special cases of GLMs.
Other examples include Gamma regression and negative binomial regression.